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BREAKING REPLAY DEPENDENCY LOOPS IN A PROCESSOR 
USING A RESCHEDULED REPLAY QUEUE 



FIELD OF THE INVENTION 

The invention generally relates to processors, and in particular to processors having 
a replay system for data speculation. 

BACKGROUND 

The primary function of most computer processors is to execute a stream of 
computer instructions that are retrieved from a storage device. Many processors are 
designed to fetch an instruction and execute that instruction before fetching the next 
instruction. With these processors, there is an assurance that any register or memory value 
that is modified or retrieved by a given instruction will be available to instructions following 
it. For example, consider the following set of instructions: 

1: Load memory- 1 into register-X; 

2: Add register-X register-Y into register-Z; 

3: Add register-Y register-Z into register-W. 
The first instruction loads the content of memory- 1 into register-X. The second instruction 
adds the content of register-X to the content of register-Y and stores the result in register-Z. 
The third instruction adds the content of register-Y to the content of register-Z and stores 
the result in register-W. In this set of instructions, instructions 2 and 3 are considered 
dependent instructions because their execution depends on the result of and prior execution 
of instruction 1. In other words, if register-X is not loaded with valid data in instruction 1 
before instructions 2 and 3 are executed, instructions 2 and 3 will generate improper 
results. 

With traditional fetch and execute processors, the second instruction will not be 
executed until the first instruction has properly executed. For example, the second 
instruction may not be dispatched to the processor until a signal is received stating that the 
first instruction has completed execution. Similarly, because the third instruction is 
dependent on the second instruction, the third instruction will not be dispatched until an 
indication that the second instruction has properly executed has been received. Therefore, 
according to the traditional fetch and execute method, this short sequence of instructions 
cannot be executed in less time than T^Lj + + L^, where T represents time and L^, 
and represent the latency of each of the three instructions. 

To increase the speed of processing, later, dependent instructions may be selectively 
chosen and speculatively executed in an effort to anticipate the results needed to increase the 
throughput of the processor and decrease overall execution time. Data speculation may 
involve speculating that data retrieved from a cache memory is valid. Processing proceeds 
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on the assumption that data retrieved from the cache is good. However, when the data in 
the cache is invalid, the results of the execution of the speculatively executed instructions 
are disregarded, and the processor backs up to re-execute the instruction that was executed. 
Stated another way, data speculation assumes that data in a cache memory are correct, that 
5 is, that the cache memory contains the result from those instructions on which the present 
instruction is dependent. Data speculation may involve speculating that data from the 
execution of an instruction on which the present instruction is dependent will be stored in a 
location in cache memory such that the data in the cache memory will be valid by the time 
the instruction attempts to access the location in cache memory. The dependent instruction 

10 is dependent on the result of a load of the result of the instruction on which it is dependent. 
When the load misses the cache, the dependent instruction must be re-executed. 

For example, while instruction 1 is executing, instruction 2 may be executed in 
parallel such that instruction 2 speculatively accesses the value stored in a location in cache 
memory where the result of instruction 1 will be stored. In this way, instruction 2 executes 

15 assuming a cache hit. If the value of the contents of the cache memory is valid, then the 
execution of instruction 1 has completed. If instruction 2 is successfully speculatively 
executed in advance of the completion of instruction 1, then, rather than execute instruction 
2 at the completion of instruction 1, a simple and quick check may be made to confirm that 
the speculative execution was successful. In this way, processors increase their execution 

20 speed and throughput by executing instructions in advance by speculatively executing 
instruction base don the assumption that needed data will be available in cache memory. 
However, in some circumstances instructions are not successfully speculatively 
executed. In these situations, the speculatively executed instructions must be re-executed. 
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BRIEF DESCRIPTION OF THE DRAWINGS 
Figure 1 is a diagram illustrating a portion of a processor according to an 

embodiment of the present invention. 

Figure 2 is a block diagram illustrating a processor including an embodiment of 

the present invention. 

Figure 3 is a flow chart illustrating a method of instruction processing according 

to an embodiment of the present invention. 
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DETAILED DESCRIPTION 

A, Introduction 

According to an embodiment of the present invention, a processor is provided that 
speculatively schedules instructions for execution and allows for replay of unsuccessfully 
5 executed instructions. Speculative scheduling allov^s the scheduling latency for instructions 
to be reduced. The replay system re-executes instructions that were not successfully 
executed when they were originally dispatched to an execution unit. An instruction is 
considered not successfully executed when the instruction is executed with bad input data, 
or an instruction whose output are bad due to a cache miss, etc. For example, a memory 
1 0 load instruction may not execute properly if there is a cache miss during execution, thereby 
requiring the instruction to be re-executed. In addition, all instructions dependent thereon 
must also be replayed, 

A challenging aspect of such a replay system is the possibility for long latency 
instructions to circulate through the replay system and re-execute many times before 
15 executing successfully. With long latency instructions, in certain circumstances, several 
conditions occur which must be resolved serially. When this occurs, each condition results 
in extra replays for all dependent instructions. This results in several instructions incurring 
several sequential cache misses. This condition occurs when a chain of dependent 
instructions are replaying. In the chain, several instructions may each take a cache miss, 
20 This results in a cascading set of replays. Each instruction must replay an extra time for 

each miss in the dependency path. For example, a long latency instruction and instructions 
dependent on the long latency instruction may be re-executed multiple times until a cache hit 
occurs. 

An example of a long latency instruction is a memory load instruction in which there 
25 is an LO cache miss and an LI cache miss (Le., on-chip cache misses) on a first attempt at 
executing an instruction. As a result, the execution unit may then retrieve the data from an 
external memory device across an external bus. This retrieval may be very time 
consuming, requiring several hundred clock cycles. Any unnecessary and repeated re- 
execution of this long latency load instruction before its source data has become available 
30 wastes valuable execution resources, prevents other instructions from executing, and 
increases overall processor latency. 

Therefore, according to an embodiment, a replay queue is provided for storing the 
instructions. After unsuccessful execution of an instruction, the instruction, and 
instructions dependent thereon that have also unsuccessfully executed, are stored in the 
35 replay queue until the data the instruction requires returns, that is, the cache memory 

location for the data is valid. At this time, the instruction is considered ready for execution. 
For example, when the data for a memory load instruction retums from external memory, 
the memory load instruction may then be scheduled for execution, and any dependent 
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instructions may then be scheduled after execution of the memory load instruction has 
completed. 

B. System Architecture 

Figure 1 is a block diagram illustrating a portion of a processor according to an 
5 embodiment of the present invention. AUocator/renamer 10 receives instructions from a 
front end (not show^n). After allocating system resources for the instructions, the 
instructions are passed to replay queue 20. Replay queue 20 stores instructions in program 
order. Replay queue 20 may mark instructions as safe or unsafe based on whether the data 
required by the instruction is available. Instructions are then sent to scheduler 30. Although 

10 only one scheduler 30 is depicted so as to simplify the description of the invention, multiple 
schedulers may be coupled to replay queue 20. Scheduler 30 may re-order the instructions 
to execute the instructions out of program order to achieve efficiencies inherent with data 
speculation. To avoid excessive replay looping which decreases processor throughput, 
increases energy consumption and increases heat emitted, scheduler 30 may include counter 

15 32. Counter 32 may be used to maintain a counter for each instruction representing the 
number of times the instruction has been executed or replayed. In one embodiment, a 
counter may be included with replay queue 20. In another embodiment, a counter may be 
paired with and travel with the instruction. In yet another embodiment, an on-chip memory 
device may be dedicated to keep track of the replay status of all pending instructions. In 

20 this embodiment, the on-chip counter may be accessed by replay queue 20 and/or scheduler 
30 and/or checker 60. 

In determining the order in which instructions will be sent to be executed, scheduler 
30 uses the data dependencies of instructions and the latencies of instructions in 
determining the order of execution of instructions. That is, based on a combination of the 

25 data dependencies of various instructions and the anticipated execution time of instructions, 
that is, the latency of the instructions, the scheduler determines the order of execution of 
instructions. Scheduler 30 may refer to counter 32 to determine whether a maximum 
number of executions or replays has already occurred. In this way, excessive replays of 
instructions that are unsafe for replaying can be avoided, thus freeing up system resources 

30 for the execution of other instructions. Scheduler 30 passes instructions to execution unit 
40. Although only one execution unit 40 is depicted so as to simplify the description of the 
invention, multiple execution units may be coupled to multiple schedulers. In one 
embodiment, the scheduler may stagger instructions among a plurality of execution units 
such that execution of the instructions is performed in parallel, and out of order, in an 

35 attempt to match the data dependencies of instructions with the expected completion of other 
parallelly executing instructions on which the instruction is dependent. In one embodiment, 
a staging queue 50 may also be used to send instruction information in parallel with 
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execution unit 40 in an unexecuted, delayed form. Although a staging queue is depicted, 
some form of staging is required and any other staging method may be used. 

After execution of the instruction, checker 60 determines whether execution was 
successful. In one embodiment, this may be achieved by analyzing the data dependency of 
5 the instruction, and whether a cache hit or cache miss occurred at the cache location of the 
needed data. More specifically, the checker checks whether replay is necessary by 
checking the input registers of the instruction to determine whether the data contained 
therein was valid. To determine whether replay is necessary, the checker may check the 
condition of the result of the execution to determine whether a cache hit occurred. Other 

1 0 condition may also generate replays. For example, two instructions needing the same 
hardware resource at the same time may cause one of the instructions to replay, as two 
instructions may not access the resource simultaneously. 

If the instruction executed successfully, the instruction and its result are sent to 
retire unit 70 to be retired. Retire unit 70 retires the instruction. Retire unit 70 may be 

1 5 coupled to and communicates with allocator/renamer 10 to de-allocate the resources used by 
the instruction. Retire unit 70 may also be coupled to and communicates with replay queue 
20 to signal that the instruction has been successfully executed and should be removed from 
the replay queue. If the instruction did not execute successfully, checker 60 sends the 
instruction back to replay queue 20, and the execution cycle, also referred to herein as the 

20 replay loop, begins again for the instruction. In one embodiment, replay queue 20 returns 
instructions to scheduler 30 in program order. In one embodiment, replay queue 20 may be 
thought of as a rescheduled replay queue, as the instructions are passed back to scheduler 
30 for rescheduling and re-execution. In one embodiment, replay queue 20 maintains the 
instructions in program order. 

25 To determine when instructions should be passed to the scheduler, in one 

embodiment, replay queue 20 may maintain a set of bits for each instruction that is not 
retired by checker 60. In this embodiment, replay queue 20 may maintain a replay safe bit, 
a valid bit, an in-flight bit and a ready bit. The replay queue sets the replay safe bit to one 
(1) or true when the instruction passes the checker, such that the instruction has completed 

30 execution successfully. The replay queue sets the valid bit to one (1) or true when the 

instruction has been loaded and is in the replay queue. The replay queue sets the in-flight 
bit to one (1) or true when the instruction is in an execution unit, that is, when the 
instruction is being executed. The replay queue sets the ready bit to one (1) or true when 
the inputs or sources needed for the instruction to execute are known to be ready. The 

35 ready bit may also be referred to as a source valid bit because it is set to one (1) or true 
when the sources for the instruction are valid. 

Figure 2 is a block diagram illustrating a processor including an embodiment of 
the present invention. Processor 2 includes front end 4 which may include several units, 
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such as an instruction fetch unit, an instruction decoder for decoding instructions, that is, 

for decoding complex instructions into one or more micro-operations or |iops, and an 

instruction queue (IQ) for temporarily storing instructions. In one embodiment, the 

instructions stored in the instruction queue may be [lops. In other embodiments, other 

5 types of instructions may be used. The instructions provided by the front end may 
originate as assembly language or machine language instructions which may, in some 

embodiments, be decoded from macro-operations into jiiops. These |Liops may be thought 

of as a machine language of the micro-architecture of a processor. It is these |aops that, in 

one embodiment, are passed by the front end. In another embodiment, particularly in a 
10 reduced instruction set computer (RISC) processor chip, no decoding will be required as 

there is a one-to-one correspondence between the RISC assembly language and the |Liops of 

the processor. As set forth herein, the term instructions refers to any macro or micro 

operations, |i,ops, assembly language instructions, machine language instructions, and the 

Uke. 

15 In one embodiment, allocator/renamer 10 may receive instructions in the form of 

ILiops from front end 4. In one embodiment, each instruction may include the instruction 

and up to two logical sources and one logical destination. The sources and destination are 
logical registers (not shown) within processor 2. In one embodiment, a register alias table 
(RAT) may be used to map logical registers to physical registers for the sources and the 

20 destination. Physical sources are the actual internal memory addresses of memory on the 
chip dedicated to serve as registers. Logical registers are the registers defined by the 
processor architecture that may be recognized by persons writing assembly language code. 
For example, according to the Intel Architecture known as IA-32, logical registers include 
EAX and EBX. Such logical registers may be mapped to physical registers, such as, for 

25 example, 1011 and 1001. 

Allocator/renamer 10 is coupled to replay queue 20. Replay queue 20 is coupled to 
scheduler 30. Although only one scheduler is depicted so as to simplify the description of 
the invention, multiple schedulers may be coupled to the replay queue. Scheduler 30 
dispatches instructions received from the replay queue 20 to be executed. Instructions may 

30 be dispatched when the resources, namely physical registers, are marked valid to execute 
the instructions, and when instructions are determined to be good candidates to execute 
speculatively. That is, scheduler 30 may dispatch an instruction without first determining 
whether data needed by the instruction is valid or available. More specifically, the 
scheduler dispatches speculatively based on the assumption that needed data is available in 

35 the cache memory. That is, the scheduler dispatches instructions based on latencies, 
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assuming that the cache location holding needed input to an instruction will result in a cache 
hit when the instruction requests needed data from the cache memory during execution. 
Scheduler 30 outputs instructions to execution unit 40, Although only one execution unit is 
depicted so as to simplify the description of the invention, multiple execution units may be 
5 coupled to multiple schedulers. Execution unit 40 executes received instructions. 

Execution unit 118 may be comprised of an arithmetic logic unit (ALU), a floating point 
unit (FPU), a memory unit for performing memory loads (memory data reads) and stores 
(memory data writes), etc. 

Execution unit 40 may be coupled to multiple levels of memory devices from which 

10 data may be retrieved and to which data may be stored. In one embodiment, execution unit 
40 is coupled to LO cache system 44 , and LI cache system 46 , and external memory 
devices via memory request controller 42. As described herein, the term cache system 
includes all cache related components, including cache memory and hit/miss logic that 
determines whether requested data is found in the cache memory. LO cache system 44 is 

1 5 the fastest memory device and may be located on the same semiconductor die as execution 
unit 40. As such, data can be retrieved from and written to LO cache very quickly. In one 
embodiment, LO cache system 44 and LI cache system 46 are located on the die of 
processor 2, while L2 cache system 84 is located off the die of processor 2. In another 
embodiment, an L2 cache system may be included on the die adjacent to the LI cache 

20 system and coupled to the execution unit via a memory request controller. In such an 
embodiment, an L3 cache system may be located off the die of the processor. 

If data requested by execution unit 40 is not found in LO cache system 44, execution 
unit 40 may attempt to retrieve needed data from additional levels of memory devices. Such 
requests may be made through memory request controller 42. After LO cache system 44 is 

25 checked, the next level of memory devices is LI cache system 46. If the data needed is not 
found in LI cache system 46, execution unit 40 may be forced to retrieve the needed data 
from the next level of memory devices, which, in one embodiment, may be external 
memory devices coupled to processor 2 via extemal bus 82. An extemal bus interface 48 
may be coupled to memory request controller 42 and extemal bus 82. In one embodiment, 

30 extemal memory devices may include some, all of, and/or multiple instances of L2 cache 
system 84, main memory 86, disk memory 88, and other storage devices, all of which may 
be coupled to extemal bus 82. In one embodiment, main memory 86 comprises dynamic 
random access memory (DRAM). Disk memory 88, main memory 86 and L2 cache system 
84 are considered extemal memory devices because they are extemal to the processor and 

35 are coupled to the processor via an extemal bus. Access to main memory 86 and disk 
memory 88 are substantially slower than access to L2 cache system 84. Access to all 
extemal memory devices is much slower than access to the on-die cache memory systems. 

In one embodiment, a computer system may include a first extemal bus dedicated to 
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an L2 cache system and a second external bus used by all other external memory devices. 
In various embodiments, processor 2 may include one, two, three or more levels of on-die 
cache memory systems. 

When attempting to load data to a register from memory, execution unit 40 may 
5 attempt to load the data from each of the memory devices from fastest to slowest. In one 
embodiment, the fastest level of memory devices, LO cache system 44, is checked first, 
followed by LI cache system 46, L2 cache system 84, main memory 86, and disk memory 
88. The time to load memory increases as each additional memory level is accessed. When 
the needed data is eventually found, the data retrieved by execution unit 40 is stored in the 

1 0 fasted available memory device to allow for future access. In one embodiment, this may be 
LO cache system 44. 

Processor 2 further includes a replay mechanism implemented via checker 60 and 
replay queue 20. Checker 60 is coupled to receive input from execution unit 40 and is 
coupled to provide output to replay queue 20. This replay mechanism provides that 

15 instructions that were not executed successfully may be re-executed or replayed. In one 
embodiment, staging queue 50 may be coupled between scheduler 30 and checker 60, in 
parallel with execution unit 40. In this embodiment, staging queue 50 may delay 
instructions for a fixed number of clock cycles so that the instruction in the execution unit 
and its corresponding result in the staging queue may enter the checker at the same moment 

20 in time. In various embodiments, the number of stages in staging queue 50 may vary based 
on the amount of staging or delay desired in each execution channel. A copy of each 
dispatched instruction may be staged through staging queue 50 in parallel to being executed 
through execution unit 40. In this manner, a copy of the instruction maintained in staging 
queues 50 is provided to checker 60. This copy of the instruction may then be routed back 

25 to replay queue 20 by checker 60 for re-execution if the instruction did not execute 
successfully. 

Checker 60 receives instructions output from staging queue 50 and execution unit 
40, and determines which instructions have executed successfully and which have not. If 
an instruction has executed successfully, checker 60 marks the instruction as completed. 

30 Completed instructions are forwarded to retire unit 70 which is coupled to checker 60. 

Retire unit 70 un-re-orders instructions, placing the instructions in original, program order 
and retires the instruction. In addition, retire unit 70 is coupled to allocator/renamer 10. 
When an instruction is retired, retire unit 70 instructs allocator/renamer 10 to de-allocate the 
resources that were used by the retired instruction. In addition, retire unit 70 may be 

35 coupled to and communicate with replay queue 20 so that upon retirement, a signal is sent 
from retire unit 70 to replay queue 20 such that all data maintained by replay queue 20 for 
the instruction are de-allocated. 
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Execution of an instruction may be considered unsuccessful for multiple reasons. 
The most common reasons are an unfulfilled source dependency and an external replay 
condition. An unfulfilled source dependency can occur when a source of a current 
instruction is dependent on the result of another instruction which has not yet completed 
5 successfully. This data dependency may cause the current instruction to execute 

unsuccessfully if the correct data for the source is not available at execution time , that is, 
the result of a predecessor instruction is not available as source data at execution time 
resulting in a cache miss. 

In one embodiment, checker 60 may maintain a table known as scoreboard 62 to 

10 track the readiness of sources such as registers. Scoreboard 62 may be used by checker 60 
to keep track of whether the source data was valid or correct prior to execution of an 
instruction. After an instruction has been executed, checker 60 may use scoreboard 62 to 
determine whether data sources for the instruction were valid. If the sources were not valid 
at execution time, this may indicate to checker 60 that the instruction did not execute 

1 5 successfully due to an unfulfilled data dependency, and the instruction should therefore be 
replayed. 

External replay conditions may include a cache miss (e.g., source data was not 
found in the LO cache system at time of execution), incorrect forwarding of data (e,g., from 
a store buffer to a load), hidden memory dependencies, a write back conflict, an unknown 

20 data address, serializing instructions, etc. In one embodiment, LO cache system 44 and LI 
cache system 46 may be coupled to checker 60. In this embodiment LO cache system 44 
may generate an LO cache miss signal to checker 60 when there is a cache miss at the LO 
cache system. Such a cache miss indicates that the source data for the instruction was not 
found in LO cache system 44. In another embodiment, similar information and/or signals 

25 may be similarly generated to checker 60 to indicate the occurrence of a cache miss at LI 

cache system 46 and external replay conditions, such as from any external memory devices, 
including L2 cache, main memory, disk memory, etc. In this way, checker 60 may 
determine whether each instruction has executed successfully. 

If checker 60 determines that the instruction has not executed successfully, checker 

30 60 signals replay queue 20 that the instruction must be replayed, re-executed. More 

specifically, checker 60 sends a replay needed signal to replay queue 20 that also identifies 
the instruction by an instruction sequence number, Instructions that do not execute 
successfully may be signaled to replay queue 20 such that all instructions, regardless of the 
type of instruction or the specific circumstances under which the instruction failed to 

35 execute successfully will be signaled to replay queue 20 and replayed. Such unconditional 
replaying works well for instructions with short latencies which require only one or a small 
number of passes or replay iterations between checker 60 and replay queue 20. 
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As instructions may be speculatively scheduled for execution {i.e., before actually 
waiting for the correct source data to be available) on the expectation that the source data 
will be available, if it turns out that the source data was not available at the time of execution 
but that the sources are now valid, checker 60 determines that the instruction is safe for 
5 replay and signals replay queue 20 that the instruction needs to be replayed. That is, 
because the source data was not available or correct at the time of execution, but that the 
data sources are now available and ready, checker 60 determines that the instruction is safe 
for replay and signals replay queue 20 that the instruction is replay safe. In one 
embodiment, the replay queue may mark a replay safe bit as true and clears an in-flight bit 

10 for the instruction. In this situation, in one embodiment, checker 60 may signal replay 

queue 20 with a replay safe bit paired with an instruction identifier. In other embodiments, 
the signals may be replaced with the checker sending actual instructions and accompanying 
information to the replay queue. If at the time execution of the instruction is completed the 
sources were not and are still not available, the replay safe bit and the in-flight bit may be 

15 cleared, that is may be set to false. 

Some long latency instructions may require many iterations through the replay loop 
before finally executing successfully. If the instruction did not execute successfully on the 
first attempt, checker 60 may determine whether the instruction requires a relatively long 
period of time to execute {i.e., a long latency instruction), requiring several replays before 

20 executing properly. There are many examples of long latency instructions. One example is 
a divide instruction which may require many clock cycles to execute. 

Another example of a long latency instruction is a memory load or store instruction 
involving multiple levels of cache system misses, such as an LO cache system miss and an 
LI cache system miss. In such cases, an external bus request may be required to retrieve 

25 the data for the instruction. If access across an external bus is required to retrieve the 

desired data, the access delay is substantially increased. To retrieve data from an external 
memory, a memory request controller may be required to arbitrate for ownership of an 
external bus, issue a bus transaction (memory read) to the external bus, and then await 
return of the data from one of the external memory devices. Many more clock cycles may 

30 be required to retrieve data from a memory device on an external bus versus when 

compared to the time needed to retrieve data from on-chip cache systems such as, for 
example, LO cache system or LI cache system. Thus, due to the need to retrieve data from 
an external memory device across an external bus, load instructions involving cache misses 
of on-chip cache systems may be considered to be long latency instructions. 

35 During this relatively long period of time while the long latency instruction is being 

processed, it is possible that an instruction may circulate an inordinate number of times, 
anywhere from tens to hundreds of iterations, from replay queue 20 to scheduler 30 to 
execution unit 40 to checker 60 and back again. Through each iteration, a long latency 

042390.P9576 1 1 Express Mail No.: EL634500996US 



instruction may be replayed before the source data has returned, this instruction 
unnecessarily occupies a slot in the execution pipeline and uses execution resources which 
could have been allocated to other instructions which are ready to execute and may execute 
successfully. Moreover, there may be many additional instructions which are dependent 
5 upon the result of this long latency instruction which will similarly repeatedly circulate 
without properly executing. These dependent instructions will not execute properly until 
after the data for the long latency instruction returns from the external memory device, 
occupying and wasting even additional execution resources. The unnecessary and 
excessive iterations which may occur before the return of needed data may waste execution 
10 resources, may waste power, and may increase overall latency. In addition, such iterations 
may cause a backup of instructions and greatly reduce processor performance in the form of 
reduced throughput. 

For example, where several calculations are being performed for displaying pixels 
on a display, an instruction for one of the pixels may be a long latency instruction, e.g., 
1 5 requiring a memory access to an external memory device. There may be many non- 
dependent instructions for other pixels behind this long latency instruction that do not 
require an external memory access. As a result, by continuously replaying the long latency 
instruction and its dependent instructions, non-dependent instructions for other pixels may 
be precluded from execution. Once the long latency instruction has properly executed, 
20 execution slots and resources become available, and the instructions for the other pixels 

may then be executed. To prevent this condition, long latency instructions and instructions 
dependent thereon are kept in the replay queue until the sources for the long latency 
instruction are available. The instructions are then released. This is achieved by storing 
long latency instructions and instructions dependent thereon in replay queue 20. When data 
25 for a long latency instruction becomes available, such as when returning from an external 
memory device, the long latency instruction and its dependent instructions may then be sent 
by replay queue 20 to scheduler 30 for replay. In one embodiment, this may be 
accomplished by setting the valid bit to true and clearing, or setting to false or zero, the 
other bits, such as the in-flight bit, the replay safe bit and the ready bit, for each of the long 
30 latency instruction and instructions dependent thereon. In this embodiment, when the 

replay queue learns that the sources for the long latency instruction are available, the ready 
bit for the long latency instruction may be set to true, thus causing the replay queue to 
release the long-latency instruction and instructions dependent thereon to the scheduler. In 
this manner, long latency instructions will not unnecessarily delay execution of other non- 
35 dependent instructions. Performance is improved, throughput is increased, and power 
consumption is decreased when the non-dependent instructions execute in parallel while 
the long latency instruction awaits return of its data. 
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In one embodiment, a long latency instruction may be identified and loaded into 
replay queue 20, and one or more additional instructions, that is, instructions which may be 
dependent upon the long latency instruction, may also be loaded into replay queue 20, 
When the condition causing the long latency instruction to not complete successfully is 
5 cleared, such as, for example, when data returns from an external bus after a cache miss or 
after completion of a division or multiplication operation or completion of another long 
latency instruction, replay queue 20 then transfers the instructions to scheduler 30 so that 
the long latency instruction and the additional instructions may then be replayed, re- 
executed. 

1 0 However, it may be difficult to identify dependent instructions because there can be 

hidden memory dependencies, etc. Therefore, in one embodiment, when a long latency 
instruction is identified and loaded into replay queue 20, all additional instructions which do 
not execute properly and are programmatically younger may be loaded into the replay queue 
as well. That is, in one embodiment, younger instructions may have sequence numbers 

1 5 greater than those of older instructions that were provided by the front end earlier in time. 

To avoid unnecessary replay of instructions, a counter may also be used. In one 
embodiment, a counter may be combined with the instruction and related information as it is 
passed from replay queue to scheduler to execution unit to checker within the processor. In 
another embodiment, a counter may be included with the replay queue. In yet another 

20 embodiment, an on-chip memory device may be dedicated to keep track of the replay status 
of all pending instructions and may be accessible by the scheduler, the replay queue, and/or 
the checker. In any of these embodiments, the counter may be used to maintain a the 
number of times an instruction has been executed or replayed. In one embodiment, when 
the scheduler receives an instruction , the counter for the instruction may automatically be 

25 incremented. The counter is used to break replay dependency loops and to alleviate 

unnecessary execution of instructions that cannot yet be successfully executed. In one 
embodiment, the scheduler checks whether the counter for the instruction has exceeded a 
machine specified maximum number of replays. If the counter exceeds the maximum, the 
instruction is not scheduled to be executed until the data required by the instruction is 

30 available. When the data is available, the instruction is deemed safe for execution. 

According to this method, any instruction cannot loop through the processor more than the 
machine specified maximum number of iterations. In this way the replay loop is broken 
when the machine specified maximum number of iterations has been exceeded. 
C. A Method of Instruction Processing 

35 Figure 3 is a flow chart illustrating a method of instruction processing according 

to an embodiment of the present invention. A plurality of instructions are received, as 
shown in block 110. System resources are then allocated for use with the execution of the 
instructions, including renaming of resources, as shown in block 112. The instructions are 
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then placed in a queue, as shown in block 114. Execution of the instructions is then 
scheduled based on data dependencies of the instructions and expected latencies of the 
instructions, as shown in block 116. A check is then made to determine whether a counter 
for the instruction is set to zero, signifying that it will be the first time that the instruction 
5 will be executed, as shown in block 118. If the counter is set to zero, the instruction is then 
executed as shown in block 124. If the counter is not zero, a check is then made to 
detennine whether the counter for the instruction exceeds a system specified maximum, as 
shown in block 120. That is, a check is made to determine whether the maximum number 
of replay iterations has been exceeded. If the counter does not exceed a system specified 

10 maximum, the instruction is executed, as shown in block 124. If the counter exceeds the 
system specified maximum, a check is made to determine whether the data sources for the 
instruction are ready and available, as shown in block 122. That is, a check is made to 
determine whether the instruction is safe to be executed. If the instruction is not safe to be 
executed and if the maximum number of replays has been exceeded, the replay loop is 

15 broken, and flow continues at block 1 16, as shown in blocks 120 and 122. That is, 

whenever the instruction is not safe for execution and the maximum number of replays or 
executions is exceeded, further replays are prevented until the data for the instruction is 
available. 

If the instruction is being executed or the first time, if the instruction is safe to be 

20 executed, or if the maximum number of replays of the instruction has not been exceeded, an 
attempt is made at executing the instruction, as shown in block 124. In the situations where 
the instruction is being executed for the first time and when the counter has not been 
exceeded, the instruction is executed whether or not the data for it is available. In this way, 
the execution may be made speculatively. A check is then made to determine if the 

25 execution of the instruction was successful, as shown in block 126. If execution of the 

instruction was not successful, the method continues at block 128 where the counter for the 
instruction is incremented. The a signal is then sent to the queue signifying that that 
execution was unsuccessful and the instruction should be rescheduled for re-execution. If 
execution of the instruction is successful, as shown in block 126, the instruction is retired, 

30 including placing the instructions in program order, de-allocating system resources used by 
the instruction, and removing the instruction and related information from the queue, as 
shown in block 130. 

In another embodiment, the checks for data availability and exceeding the replay 
counter may be reversed. In this embodiment, blocks 120 and 122 may be executed in 

35 reverse order. In this embodiment, a check is made to leam whether the instruction is safe, 
and, if it is, the instruction is executed. If the instruction is not safe, a check is made to 
determine whether the maximum number of replays has been exceeded; if it has, the replay 
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loop is broken, and flow proceeds to block 1 16. If the maximum number of replays has 
not been exceeded, a speculative execution proceeds. 

In the foregoing specification, the invention has been described with reference to 
specific embodiments thereof. It will, however, be evident that various modifications and 
changes can be made thereto without departing from the broader spirit and scope of the 
invention as set forth in the appended claims. The specification and drawings are, 
accordingly, to be regarded in an illustrative rather than a restrictive sense. 
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CLAIMS 



What is claimed is: 



1 1 . A processor comprising: 

2 a replay queue to receive a plurality of instructions; 

3 an execution unit to execute the plurality of instructions; 

4 a scheduler coupled between the replay queue and the execution unit to 

5 speculatively schedule instructions for execution, to increment a counter for each of 

6 the plurality of instructions to reflect the number of times each of the plurality of 

7 instructions has been executed, and to dispatch each instruction of the plurality of 

8 instructions to the execution unit either v^hen the counter does not exceed a 

9 maximum number of replays or, if the counter for the instruction exceeds the 

10 maximum number of replays, when the instruction is safe to execute; and 

11 a checker coupled to the execution unit to determine whether each instruction 

1 2 has executed successfully, and coupled to the replay queue to communicate to the replay 

1 3 queue each instruction that has not executed successfully, 

1 2. The processor of claim 1 further comprising: 

2 an allocator/renamer coupled to the replay queue to allocate and rename 

3 those of a plurality of resources needed by the instruction. 

1 3 . The processor of claim 2 further comprising: 

2 a front end coupled to the allocator/renamer to provide the plurality of 

3 instructions to the allocator/renamer. 

1 4, The processor of claim 2 further comprising: 

2 a retire unit to retire the plurality of instructions, coupled to the checker to 

3 receive those of the plurality of instructions that have executed successfully, and 

4 coupled to the allocator/renamer to communicate a de-allocate signal to the 

5 allocator/renamer. 

1 5 . The processor of claim 4 wherein the retire unit is further coupled to the 

2 replay queue to communicate a retire signal when one of the plurality of instructions is 

3 retired. 



042390.P9576 



16 



Express Mail No.: EL634500996US 



1 6, The processor of claim 1 further comprising: 

2 at least one cache system on a die of the processor; 

3 a plurality of external memory devices; and 

4 a memory request controller coupled to the execution unit to obtain data 

5 from the at least one cache system and the plurality of external memory devices. 

1 7 . The processor of claim 6 wherein the at least one cache system comprises 

2 a first level cache system and a second level cache system, 

1 8 . The processor of claim 6 wherein the external memory devices comprise at 

2 least one of a third level cache system, a main memory, and a disk memory. 

1 9 . The processor of claim 1 further comprising: 

2 a staging queue coupled between the checker and the scheduler. 

1 10. The processor of claim 1 wherein the scheduler comprises a plurality of 

2 counters to maintain each of the plurality of counters for each of the plurality of 

3 instructions. 

1 11. The processor of claim 1 wherein the counter is one of a plurality of 

2 counters such that each counter of the plurality of counters is paired with one of the 

3 plurality of instructions. 

1 12. The processor of claim 1 wherein the checker comprises a scoreboard to 

2 maintain a status of a plurality of resources. 

1 1 3 . A processor comprising: 

2 a replay queue to receive a plurality of instructions; 

3 at least two execution units to execute the plurality of instructions; 

4 at least two schedulers coupled between the replay queue and the execution 

5 units to schedule instructions for execution, to increment a counter for each of the 

6 plurahty of instructions to reflect the number of times each of the plurality of 

7 instructions has been executed, and to communicate each instruction of the plurality 

8 of instructions to the execution units when the counter does not exceed a maximum 

9 number or, if the counter for the instruction exceeds the maximum number of 
1 0 replays, when a data required by the instruction is available; and 
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1 1 a checker coupled to the execution units to determine whether each 

1 2 instruction has executed successfully, and coupled to the replay queue to communicate each 

1 3 instruction that has not executed successfully. 

1 14. The processor of claim 13 further comprising: 

2 a plurality of memory devices coupled to the execution units such that the 

3 checker determines whether the instruction has executed successfully based on a plurality of 

4 information provided by the memory devices. 

1 15, The processor of claim 13 further comprising: 

2 an allocator/renamer coupled to the replay queue to allocate and rename 

3 those of a plurality of resources needed by the plurality of instructions. 

1 16. The processor of claim 15 further comprising: 

2 a front end coupled to the allocator/renamer to provide the plurality of 

3 instructions to the allocator/renamer. 

1 17. The processor of claim 15 further comprising: 

2 a retire unit to retire the plurality of instructions, coupled to the checker to 

3 receive those of the plurality of instructions that have executed successfully, and 

4 coupled to the allocator/renamer to communicate a de-allocate signal to the 

5 allocator/renamer. 

1 18. The processor of claim 17 wherein the retire unit is further coupled to the 

2 replay queue to communicate a retire signal when one of the plurality of instructions is 

3 retired. 

1 1 9 . A method comprising : 

2 receiving an instruction of a plurality of instructions; 

3 placing the instruction in a queue with other instructions of the plurality of 

4 instructions; 

5 speculatively re-ordering those of the plurality of instructions in a scheduler 

6 based on data dependencies and instruction latencies; 

7 dispatching one of the plurality of instructions to an execution unit to be 

8 executed either when a counter for the instruction does not exceed a maximum 

9 number of replays or, if the counter for the instruction exceeds the maximum 

1 0 number of replays, when a required data for the instruction is available; 

1 1 executing the instruction; 
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1 2 determining whether the instruction executed successfully; 

1 3 routing the instruction back to the queue if the instruction did not execute 

14 successfully; and 

1 5 retiring the instruction if the instruction executed successfully. 

1 20. The method of Claim 19 further comprising: 

2 allocating those of a plurality of system resources needed by the instruction. 

1 21. The method of Claim 20 wherein retiring comprises: 

2 de-allocating those of the plurality of system resources used by the 

3 instruction being retired; 

4 removing the instruction and a plurality of related data from the queue, 

1 22. The method of Claim 19 further comprising: 

2 maintaining a plurality of counters, one each for each of the plurality of 

3 instructions in the scheduler such that the counters reflect the number of times the 

4 corresponding instruction has been executed. 

1 23. The method of Claim 22 wherein each of the plurality of counters for the 

2 instruction is paired with each of the plurality of the instructions. 

1 24. The method of Claim 22 wherein the plurality of counters is stored in the 

2 scheduler. 
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ABSTRACT 

Breaking replay dependency loops in a processor using a rescheduled replay queue. 
The processor comprises a replay queue to receive a plurality of instructions, and an 
execution unit to execute the plurality of instructions. A scheduler is coupled between the 
5 replay queue and the execution unit. The scheduler speculatively schedules instructions for 
execution and increments a counter for each of the plurality of instructions to reflect the 
number of times each of the plurality of instructions has been executed. The scheduler also 
dispatches each instruction to the execution unit either when the counter does not exceed a 
maximum number of replays or, if the counter exceeds the maximum number of replays, 
1 0 when the instruction is safe to execute. A checker is coupled to the execution unit to 

determine whether each instruction has executed successfully. The checker is also coupled 
to the replay queue to communicate to the replay queue each instruction that has not 
executed successfully. 
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and Justin M. Dillon, Reg. No. 42,486; my patent agents, with offices located at 12400 Wilshire Boulevard, 7th Floor, Los 
Angeles, California 90025, telephone (310) 207-3800, and Alan K. Aldous, Reg. No. 31,905; Robert D. Anderson, Reg. No. 
33,826; Joseph R. Bond, Reg. No. 36,458; Richard C. Calderwood, Reg. No. 35,468; Jeffrey S. Draeger, Reg. No. 41,000; 
Cynthia Thomas Faatz, Reg No, 39,973; Sean Fitzgerald, Reg. No. 32,027; Seth Z. Kalson, Reg. No. 40,670; David J. Kaplan, 
Reg. No. 41,105; Charles A. Mirho, Reg. No. 41,199; Leo V. Novakoski, Reg, No. 37,198; Naomi Obinata, Reg. No. 39,320; 
Thomas C. Reynolds, Reg. No. 32,488; Kenneth M. Seddon, Reg. No. 43,105; Mark Seeley, Reg. No. 32,299; Steven P. 
Skabrat, Reg, No. 36,279; Howard A. Skaist, Reg. No. 36,008; Steven C. Stewart, Reg. No. 33,555; Raymond J. Werner, Reg. 
No. 34,752; Robert G. Winkle, Reg. No. 37,474; and Charles K. Young, Reg. No. 39,435; my patent attorneys, and Peter Lam, 
Reg. No. P44,855; Thomas Raleigh Lane, Reg. No. 42,781; Gene I. Su, Reg. No. 45,140; and Calvin E. Wells, Reg. No. 
P43,256; my patent agents, of INTEL CORPORATION with full power of substitution and revocation, to prosecute this 
application and to transact all business in the Patent and Trademark Office connected herewith. 



Send correspondence to Eric S. Hvman. Reg. No. 30,139 . BLAKELY, SOKOLOFF, TAYLOR & 

(Name of Attorney or Agent) 

ZAFMAN LLP, 12400 Wilshire Boulevard, 7th Floor, Los Angeles, California 90025 and direct telephone 
calls to Eric S. Hvman. Reg. No. 30,139 . (310) 207-3800. 

(Name of Attorney or Agent) 

I hereby declare that all statements made herein of my own knowledge are true and that all statements made on 
information and belief are believed to be true; and further that these statements were made with the knowledge 
that willful false statements and the like so made are punishable by fine or imprisonment, or both, under Section 
1001 of Title 18 of the United States Code and that such willful false statements may jeopardize the validity of 
the application or any patent issued thereon. 

Full Name of Sole/First Inventor (given name, family name) Darrell D. Boggs 



Inventors Signature ^^^^^^^^ 07^y^ Date O^r QC) 

Residence Aloha, Oregon Citizenship USA 



{City , State) ( Country) 

P. O. Address 2200 S.W. 195th Avenue 

Aloha, Oregon 97006 USA 
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Full Name of Second/Joint Inv^ptor (giv/n naiTie>^family name) 

Inventor's Signature ^ 
Residence 

(City , State) 

P. O. Address 14815 SW Bonnie Brae 



Douglas M. Carmean 



Date 



OCT y <^ 



Citizenship USA 



(Country) 



Beaverton, Oregon 97007 USA 



Full Name of Third/Joint Iiwentor (given name, family name) Per H. Hammarlund 

Inventor's Signature ^^J^^^ ^ '^^ Date L>-0 I O 3 O 

Residence Hillsboro, Oregon Citizenship Sweden 



(City , State) (Country) 

P. O. Address 2601 NE 2nd Drive 



Hillsboro, Oregon 97124 USA 



Full Name of Four^/Join^ Inventor (given name, family name) Francis X. McKeen 

Inventor's Signature j:^!f5L^i,,^t^^^ [J^^ 

Residence Portland, Oregon Citizenship USA 




(City , State) (Country) 

P. O. Address 10612 NW LeMans Court 

Portland, Oregon 97229 USA 



Full Name of Fifth/Joint\Inventor^ (given name^, family name) David J. Sager 

Inventor's Signature {(JrU^z^ jJ^ A^^J^ ^ Date ld^3C>^60 

Residence Portland, Oregon ^ Citizenship USA 



(City , State) (Country) 
P. O. Address 9540 NW Sky view Drive ^ ^ 



Portland, Oregon 97231 USA 
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Title 37, Code of Federal Regulations, Section 1.56 
Duty to Disclose Information Material to Patentability. 

(a) A patent by its very nature is affected with a public interest. The public interest is best served, and the most effective 
patent examination occurs when, at the time an application is being examined, the Office is aware of and evaluates the 
teachings of all information material to patentability. Each individual associated with the filing and prosecution of a patent 
application has a duty of candor and good faith in dealing with the Office, which includes a duty to disclose to the Office all 
information known to that individual to be material to patentability as defined in this section. TTie duty to disclosure 
information exists with respect to each pending claim until the claim is cancelled or withdrawn from consideration, or the 
application becomes abandoned. Information material to the patentability of a claim that is cancelled or withdrawn from 
consideration need not be submitted if the information is not material to tiie patentability of any claim remaining under 
consideration in the application. There is no duty to submit information which is not material to the patentabiUty of any 
existing claim. The duty to disclosure all information known to be material to patentability is deemed to be satisfied if all 
information known to be material to patentability of any claim issued in a patent was cited by the Office or submitted to the 
Office in the manner prescribed by §§1.97(b)-(d) and 1.98. However, no patent will be granted on an application in 
connection with which fraud on the Office was practiced or attempted or the duty of disclosure was violated through bad 
faith or intentional misconduct. The Office encourages applicants to carefully examine: 

(1) Prior art cited in search reports of a foreign patent office in a counterpart application, and 

(2) The closest information over which individuals associated with the filing or prosecution of a patent application believe 
any pending claim patentably defines, to make sure that any material information contained therein is disclosed to the Office. 

(b) Under this section, information is material to patentabiHty when it is not cumulative to information already of record or 
being made or record in the apphcation, and 

(1) It establishes, by itself or in combination with other information, a prima facie case of unpatentability of a claim; or 

(2) It refutes, or is inconsistent with, a position the applicant takes in: 

(i) Opposing an argument of unpatentability relied on by the Office, or 

(ii) Asserting an argument of patentability, 

A prima facie case of unpatentability is established when the information compels a conclusion that a claim is unpatentable 
under the preponderance of evidence, burden-of-proof standard, giving each term in the claim its broadest reasonable 
construction consistent with the specification, and before any consideration is given to evidence which may be submitted in 
an attempt to estabUsh a contrary conclusion of patentability. 

(c) Individuals associated with the filing or prosecution of a patent application within the meaning of this section are: 

(1) Each inventor named in the application; 

(2) Each attorney or agent who prepares or prosecutes the application; and 

(3) Every other person who is substantively involved in the preparation or prosecution of the application and who is 
associated with the inventor, with the assignee or with anyone to whom there is an obligation to assign the application. 

(d) Individuals other than the attorney, agent or inventor may comply with this section by disclosing information to the 
attorney, agent, or inventor. 
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